feat(runtime): add RuntimeTracer trait for task instrumentation#2467
feat(runtime): add RuntimeTracer trait for task instrumentation#2467xanderbailey wants to merge 3 commits into
RuntimeTracer trait for task instrumentation#2467Conversation
Allows callers to inject observability (tracing spans, metrics, etc.) into spawned tasks without modifying spawn sites. Inspired by DataFusion's JoinSetTracer but scoped per-Runtime rather than global, enabling different tracers for IO vs CPU handles.
|
@CTTY would be interested on your thoughts here? |
|
Hi Xander, Thanks for having this! Curious to hear what others think as well |
|
Makes sense, I thought about this too and this trait is designed to be agnostic of the concrete runtime (tokio) in the hope that it wouldn’t need to be changed. It just generally wraps futures. Is the worry that we don’t know if that’s actually true until we design a runtime trait? |
| self.io.tracer = Some(tracer.clone()); | ||
| self.cpu.tracer = Some(tracer); |
There was a problem hiding this comment.
Why is the io.tracer cloned, but the cpu tracer isn't?
There was a problem hiding this comment.
You need it twice so the first is a "move" but clone of Arc is very cheap
| /// The implementation must not alter the closure's return value. | ||
| fn trace_block( | ||
| &self, | ||
| f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>, |
There was a problem hiding this comment.
Because this is blocking do we really need the Send trait?
There was a problem hiding this comment.
Tokio spawn_blocking requires Send because you pass these to thread pool so you do cross a thread boundary. https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html
| ) -> impl FnOnce() -> T + Send + 'static | ||
| where | ||
| F: FnOnce() -> T + Send + 'static, | ||
| T: Send + 'static, |
There was a problem hiding this comment.
Same as above why do we need Send when this is blocking?
|
Thanks @xanderbailey for this pr, but I'm not convinced that this is the right direction to go. The motivation you mentioned have two parts:
|
|
The strongest example is span propagation across spawn boundaries. When you tokio::spawn, the tracing span context is severed, the spawned task doesn't inherit the parent span. So if you're tracing a table scan that spawns 50 IO tasks to fetch manifests and data files, those tasks appear as orphans in your Jaeger/Tempo traces rather than children of the scan. Very similar conversation was had when this was introduced in datafusion apache/datafusion#14547 |
Allows callers to inject observability (tracing spans, metrics, etc.) into spawned tasks without modifying spawn sites. Inspired by DataFusion's JoinSetTracer but scoped per-Runtime rather than global, enabling different tracers for IO vs CPU handles.
Why
RuntimeTracerinstead of tokio's native tracing?iceberg-rust is a library, it shouldn't dictate the observability stack.
RuntimeTracerprovides a spawn-site hook that lets consumers inject their own instrumentation without coupling the library to any specific tracing crate.RuntimeTracertracingcrate in the librarytracingecosystem (subscribers, layers)tracing, OTel, Prometheus,log, or anything elseRuntimeHandle, so IO and CPU work are distinguishableThe two approaches are complementary: a consumer could implement
RuntimeTracerto attach atracing::Span, getting tokio-console visibility and library-level spawn-site context without iceberg taking an opinion.Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?